Intelligent online personal assistant with multi-turn dialog based on visual search

ABSTRACT

Systems, methods, and computer program products for identifying a relevant candidate product in an electronic marketplace. Embodiments perform a visual similarity comparison between candidate product image visual content and input query image visual content, process formal and informal natural language user inputs, and coordinate aggregated past user interactions with the marketplace stored in a knowledge graph. Visually similar items and their corresponding product categories, aspects, and aspect values can determine suggested candidate products without discernible delay during a multi-turn user dialog. The user can then refine the search for the most relevant items available for purchase by providing responses to machine-generated prompts that are based on the initial search results from visual, voice, and/or text inputs. An intelligent online personal assistant can thus guide a user to the most relevant candidate product more efficiently than existing search tools.

RELATED APPLICATIONS

This Application is a continuation of and claims the benefit of priorityto U.S. application Ser. No. 17/221,367, filed on Apr. 2, 2021, which isa continuation of U.S. patent application Ser. No. 15/294,765, filed onOct. 16, 2016, which is now U.S. Pat. No. 11,004,131, issued May 11,2021. The disclosures of which are hereby incorporated by reference intheir entirety.

This application is also related by subject matter to acommonly-assigned and simultaneously-filed application sharing a commonspecification: U.S. patent application Ser. No. 15/294,767, now U.S.Pat. No. 11,748,978, filed on Oct. 16, 2016, titled “Intelligent OnlinePersonal Assistant With Offline Visual Search Database,” which is herebyincorporated by reference in its entirety. The followingcommonly-assigned applications are also each hereby incorporated byreference in their entireties:

-   -   U.S. patent application Ser. No. 15/238,666, filed on Aug. 16,        2016, titled “Selecting Next User Prompt Types In An Intelligent        Online Personal Assistant Multi-Turn Dialog”,    -   U.S. patent application Ser. No. 15/238,675, filed on Aug. 16,        2016, titled “Intelligent Online Personal Assistant With Natural        Language Understanding”,    -   U.S. patent application Ser. No. 15/238,660, filed on Aug. 16,        2016, titled “Generating Next User Prompts In An Intelligent        Online Personal Assistant Multi-Turn Dialog”, and    -   U.S. patent application Ser. No. 15/238,679, filed on Aug. 16,        2016, “Knowledge Graph Construction For Intelligent Online        Personal Assistant”.        The following articles are also each incorporated by reference        in its entirety:    -   Jonathan Long, Evan Shelhamer, Trevor Darrell, “Fully        Convolutional Networks for Semantic Segmentation”, CVPR, June        2015.    -   Shuai Zheng et al., “Conditional Random Fields as Recurrent        Neural Networks”, IEEE International Conference on Computer        Vision (ICCV), 2015.

BACKGROUND

Traditional searching is text-based rather than image-based orvoice-based. Searching is overly time-consuming when too many irrelevantresults must be presented, browsed, and rejected by a user. Thetechnical limitations of conventional search tools make it difficult fora user to communicate search intent, for example by sharing photos ofinteresting products, to help start a search that may be refined byfurther user input, such as in a multi-turn dialog. As online searchesballoon to billions of possible selectable products, comparisonsearching has become more important than ever, but current text-basedsolutions were not designed for this scale. Irrelevant results are oftenshown and do not bring out the best results. Traditional forms ofcomparison searching (search+refinement+browse) are no longer useful.

BRIEF SUMMARY

In one example, an intelligent personal assistant system includesscalable artificial intelligence (AI) that permeates the fabric ofexisting messaging platforms to provide an intelligent online personalassistant (or “bot”). The system may leverage existing inventories andcurated databases to provide intelligent, personalized answers inpredictive turns of communication between a human user and anintelligent online personal assistant. One example of an intelligentpersonal assistant system includes a knowledge graph. Machine learningcomponents may continuously identify and learn from user intents so thatuser identity and understanding is enhanced over time. The userexperience thus provided is inspiring, intuitive, unique, and may befocused on the usage and behavioral patterns of certain age groups, suchas millennials, for example.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

The drawings illustrate generally, by way of example, but not by way oflimitation, various embodiments discussed in the present document. Inorder more easily to identify the discussion of any particular elementor act, the most significant digit or digits in a reference number referto the figure number in which that element is first introduced.

FIG. 1 shows a networked system, according to some example embodiments.

FIG. 2 shows a general architecture of an intelligent personal assistantsystem, according to some example embodiments.

FIG. 3 shows components of a speech recognition component, according tosome example embodiments.

FIG. 4 shows a representative software architecture softwarearchitecture, which may be used in conjunction with various hardwarearchitectures described herein.

FIG. 5 shows components of a machine, according to some exampleembodiments, able to read instructions from a machine-readable medium(e.g., a computer-readable storage medium) and perform any one or moreof the methodologies discussed herein.

FIG. 6 shows an example environment into which an intelligent onlinepersonal assistant can be deployed, according to some exampleembodiments.

FIG. 7 shows an overview of the intelligent personal assistant systemprocessing natural language user inputs to generate an itemrecommendation in an electronic marketplace, according to some exampleembodiments.

FIG. 8 shows a visual search service that interacts with a knowledgegraph service, according to some example embodiments.

FIG. 9 shows a visual product search that generates product suggestions,according to some example embodiments.

FIG. 10 shows a visual product search that generates product suggestionsin response to user non-image input, according to some exampleembodiments.

FIG. 11 shows a visual product search that generates product suggestionsin response to user category input, according to some exampleembodiments.

FIG. 12 shows a flowchart of a methodology for visual search service andknowledge graph service interaction, according to some exampleembodiments.

FIG. 13 shows an offline index generation service that interacts with anon-demand visual search service, according to some example embodiments.

FIG. 14 shows a flowchart of a methodology for offline index generationfor visual search, according to some example embodiments.

DETAILED DESCRIPTION

“CARRIER SIGNAL” in this context refers to any intangible medium that iscapable of storing, encoding, or carrying instructions for execution bythe machine, and includes digital or analog communications signals orother intangible medium to facilitate communication of suchinstructions. Instructions may be transmitted or received over thenetwork using a transmission medium via a network interface device andusing any one of a number of well-known transfer protocols.

“CLIENT DEVICE” in this context refers to any machine that interfaces toa communications network to obtain resources from one or more serversystems or other client devices. A client device may be, but is notlimited to, a mobile phone, desktop computer, laptop, portable digitalassistants (PDAs), smart phones, tablets, ultra books, netbooks,laptops, multi-processor systems, microprocessor-based or programmableconsumer electronics, game consoles, set-top boxes, or any othercommunication device that a user may use to access a network.

“COMMUNICATIONS NETWORK” in this context refers to one or more portionsof a network that may be an ad hoc network, an intranet, an extranet, avirtual private network (VPN), a local area network (LAN), a wirelessLAN (WLAN), a wide area network (WAN), a wireless WAN (WWAN), ametropolitan area network (MAN), the Internet, a portion of theInternet, a portion of the Public Switched Telephone Network (PSTN), aplain old telephone service (POTS) network, a cellular telephonenetwork, a wireless network, a Wi-Fi® network, another type of network,or a combination of two or more such networks. For example, a network ora portion of a network may include a wireless or cellular network andthe coupling may be a Code Division Multiple Access (CDMA) connection, aGlobal System for Mobile communications (GSM) connection, or other typeof cellular or wireless coupling. In this example, the coupling mayimplement any of a variety of types of data transfer technology, such asSingle Carrier Radio Transmission Technology (1×RTT), Evolution-DataOptimized (EVDO) technology, General Packet Radio Service (GPRS)technology, Enhanced Data rates for GSM Evolution (EDGE) technology,third Generation Partnership Project (3GPP) including 3G, fourthgeneration wireless (4G) networks, Universal Mobile TelecommunicationsSystem (UMTS), High Speed Packet Access (HSPA), WorldwideInteroperability for Microwave Access (WiMAX), Long Term Evolution (LTE)standard, others defined by various standard setting organizations,other long range protocols, or other data transfer technology.

“COMPONENT” in this context refers to a device, physical entity or logichaving boundaries defined by function or subroutine calls, branchpoints, application program interfaces (APIs), or other technologiesthat provide for the partitioning or modularization of particularprocessing or control functions. Components may be combined via theirinterfaces with other components to carry out a machine process. Acomponent may be a packaged functional hardware unit designed for usewith other components and a part of a program that usually performs aparticular function of related functions. Components may constituteeither software components (e.g., code embodied on a machine-readablemedium) or hardware components. A “hardware component” is a tangibleunit capable of performing certain operations and may be configured orarranged in a certain physical manner. In various example embodiments,one or more computer systems (e.g., a standalone computer system, aclient computer system, or a server computer system) or one or morehardware components of a computer system (e.g., a processor or a groupof processors) may be configured by software (e.g., an application orapplication portion) as a hardware component that operates to performcertain operations as described herein. A hardware component may also beimplemented mechanically, electronically, or any suitable combinationthereof. For example, a hardware component may include dedicatedcircuitry or logic that is permanently configured to perform certainoperations. A hardware component may be a special-purpose processor,such as a Field-Programmable Gate Array (FPGA) or an ApplicationSpecific Integrated Circuit (ASIC). A hardware component may alsoinclude programmable logic or circuitry that is temporarily configuredby software to perform certain operations. For example, a hardwarecomponent may include software executed by a general-purpose processoror other programmable processor. Once configured by such software,hardware components become specific machines (or specific components ofa machine) uniquely tailored to perform the configured functions and areno longer general-purpose processors. It will be appreciated that thedecision to implement a hardware component mechanically, in dedicatedand permanently configured circuitry, or in temporarily configuredcircuitry (e.g., configured by software) may be driven by cost and timeconsiderations. Accordingly, the phrase “hardware component” (or“hardware-implemented component”) should be understood to encompass atangible entity, be that an entity that is physically constructed,permanently configured (e.g., hardwired), or temporarily configured(e.g., programmed) to operate in a certain manner or to perform certainoperations described herein. Considering embodiments in which hardwarecomponents are temporarily configured (e.g., programmed), each of thehardware components need not be configured or instantiated at any oneinstance in time. For example, where a hardware component comprises ageneral-purpose processor configured by software to become aspecial-purpose processor, the general-purpose processor may beconfigured as respectively different special-purpose processors (e.g.,comprising different hardware components) at different times. Softwareaccordingly configures a particular processor or processors, forexample, to constitute a particular hardware component at one instanceof time and to constitute a different hardware component at a differentinstance of time. Hardware components can provide information to, andreceive information from, other hardware components. Accordingly, thedescribed hardware components may be regarded as being communicativelycoupled. Where multiple hardware components exist contemporaneously,communications may be achieved through signal transmission (e.g., overappropriate circuits and buses) between or among two or more of thehardware components. In embodiments in which multiple hardwarecomponents are configured or instantiated at different times,communications between such hardware components may be achieved, forexample, through the storage and retrieval of information in memorystructures to which the multiple hardware components have access. Forexample, one hardware component may perform an operation and store theoutput of that operation in a memory device to which it iscommunicatively coupled. A further hardware component may then, at alater time, access the memory device to retrieve and process the storedoutput. Hardware components may also initiate communications with inputor output devices, and can operate on a resource (e.g., a collection ofinformation). The various operations of example methods described hereinmay be performed, at least partially, by one or more processors that aretemporarily configured (e.g., by software) or permanently configured toperform the relevant operations. Whether temporarily or permanentlyconfigured, such processors may constitute processor-implementedcomponents that operate to perform one or more operations or functionsdescribed herein. As used herein, “processor-implemented component”refers to a hardware component implemented using one or more processors.Similarly, the methods described herein may be at least partiallyprocessor-implemented, with a particular processor or processors beingan example of hardware. For example, at least some of the operations ofa method may be performed by one or more processors orprocessor-implemented components. Moreover, the one or more processorsmay also operate to support performance of the relevant operations in a“cloud computing” environment or as a “software as a service” (SaaS).For example, at least some of the operations may be performed by a groupof computers (as examples of machines including processors), with theseoperations being accessible via a network (e.g., the Internet) and viaone or more appropriate interfaces (e.g., an Application ProgramInterface (API)). The performance of certain of the operations may bedistributed among the processors, not only residing within a singlemachine, but deployed across a number of machines. In some exampleembodiments, the processors or processor-implemented components may belocated in a single geographic location (e.g., within a homeenvironment, an office environment, or a server farm). In other exampleembodiments, the processors or processor-implemented components may bedistributed across a number of geographic locations.

“MACHINE-READABLE MEDIUM” in this context refers to a component, deviceor other tangible media able to store instructions and data temporarilyor permanently and may include, but is not limited to, random-accessmemory (RAM), read-only memory (ROM), buffer memory, flash memory,optical media, magnetic media, cache memory, other types of storage(e.g., Erasable Programmable Read-Only Memory (EEPROM)) and/or anysuitable combination thereof. The term “machine-readable medium” shouldbe taken to include a single medium or multiple media (e.g., acentralized or distributed database, or associated caches and servers)able to store instructions. The term “machine-readable medium” will alsobe taken to include any medium, or combination of multiple media, thatis capable of storing instructions (e.g., code) for execution by amachine, such that the instructions, when executed by one or moreprocessors of the machine, cause the machine to perform any one or moreof the methodologies described herein. Accordingly, a “machine-readablemedium” refers to a single storage apparatus or device, as well as“cloud-based” storage systems or storage networks that include multiplestorage apparatus or devices. The term “machine-readable medium”excludes signals per se.

“PROCESSOR” in this context refers to any circuit or virtual circuit (aphysical circuit emulated by logic executing on an actual processor)that manipulates data values according to control signals (e.g.,“commands”, “op codes”, “machine code”, etc.) and which producescorresponding output signals that are applied to operate a machine. Aprocessor may, for example, be a Central Processing Unit (CPU), aReduced Instruction Set Computing (RISC) processor, a ComplexInstruction Set Computing (CISC) processor, a Graphics Processing Unit(GPU), a Digital Signal Processor (DSP), an Application SpecificIntegrated Circuit (ASIC), a Radio-Frequency Integrated Circuit (RFIC)or any combination thereof. A processor may further be a multi-coreprocessor having two or more independent processors (sometimes referredto as “cores”) that may execute instructions contemporaneously.

A portion of the disclosure of this patent document contains materialthat is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure, as it appears in the Patent and TrademarkOffice patent files or records, but otherwise reserves all copyrightrights whatsoever. The following notice applies to the software and dataas described below and in the drawings that form a part of thisdocument: Copyright 2016, eBay Inc, All Rights Reserved.

The description that follows includes systems, methods, techniques,instruction sequences, and computing machine program products thatembody illustrative embodiments of the disclosure. In the followingdescription, for the purposes of explanation, numerous specific detailsare set forth in order to provide an understanding of variousembodiments of the inventive subject matter. It will be evident,however, to those skilled in the art, that embodiments of the inventivesubject matter may be practiced without these specific details. Ingeneral, well-known instruction instances, protocols, structures, andtechniques are not necessarily shown in detail.

With reference to FIG. 1 , an example embodiment of a high-level SaaSnetwork architecture 100 is shown. A networked system 116 providesserver-side functionality via a network 110 (e.g., the Internet or widearea network (WAN)) to a client device 108. A web client 102 and aprogrammatic client, in the example form of an application 104 arehosted and execute on the client device 108. The networked system 116includes and application server 122, which in turn hosts an intelligentpersonal assistant system 106 that provides a number of functions andservices to the application 104 that accesses the networked system 116.The application 104 also provides a number of interfaces describedherein, which present output of the tracking and analysis operations toa user of the client device 108.

The client device 108 enables a user to access and interact with thenetworked system 116. For instance, the user provides input (e.g., touchscreen input or alphanumeric input) to the client device 108, and theinput is communicated to the networked system 116 via the network 110.In this instance, the networked system 116, in response to receiving theinput from the user, communicates information back to the client device108 via the network 110 to be presented to the user.

An Application Program Interface (API) server 118 and a web server 120are coupled to, and provide programmatic and web interfacesrespectively, to the application server 122. The application server 122hosts an intelligent personal assistant system 106, which includescomponents or applications. The application server 122 is, in turn,shown to be coupled to a database server 124 that facilitates access toinformation storage repositories (e.g., a database/cloud 126). In anexample embodiment, the database/cloud 126 includes storage devices thatstore information accessed and generated by the intelligent personalassistant system 106.

Additionally, a third party application 114, executing on a third partyserver 112, is shown as having programmatic access to the networkedsystem 116 via the programmatic interface provided by the ApplicationProgram Interface (API) server 118. For example, the third partyapplication 114, using information retrieved from the networked system116, may support one or more features or functions on a website hostedby the third party.

Turning now specifically to the applications hosted by the client device108, the web client 102 may access the various systems (e.g.,intelligent personal assistant system 106) via the web interfacesupported by the web server 120. Similarly, the application 104 (e.g.,an “app”) accesses the various services and functions provided by theintelligent personal assistant system 106 via the programmatic interfaceprovided by the Application Program Interface (API) server 118. Theapplication 104 may, for example, an “app” executing on a client device108, such as an iOS or Android OS application to enable user to accessand input data on the networked system 116 in an off-line manner, and toperform batch-mode communications between the programmatic clientapplication 104 and the networked system networked system 116.

Further, while the SaaS network architecture 100 shown in FIG. 1 employsa client-server architecture, the present inventive subject matter is ofcourse not limited to such an architecture, and could equally well findapplication in a distributed, or peer-to-peer, architecture system, forexample. The intelligent personal assistant system 106 could also beimplemented as a standalone software program, which does not necessarilyhave networking capabilities.

FIG. 2 is a block diagram showing the general architecture of anintelligent personal assistant system 106, according to some exampleembodiments. Specifically, the intelligent personal assistant system 106is shown to include a front end component 202 (FE) by which theintelligent personal assistant system 106 communicates (e.g., over thenetwork 110) with other systems within the SaaS network architecture100. The front end component 202 can communicate with the fabric ofexisting messaging systems. As used herein, the term messaging fabricrefers to a collection of APIs and services that can power third partyplatforms such as Facebook messenger, Microsoft Cortana and other“bots”. In one example, a messaging fabric can support an onlinecommerce ecosystem that allows users to interact with commercial intent.Output of the front end component 202 can be rendered in a display of aclient device, such as the client device 108 in FIG. 1 as part of aninterface with an intelligent personal assistant, or “bot”.

The front end component 202 of the intelligent personal assistant system106 is coupled to a back end component 204 for the front end (BFF) thatoperates to link the front end component 202 with an artificialintelligence framework 128. The artificial intelligence framework 128may include several components as discussed below. The data exchangedbetween various components and the function of each component may varyto some extent, depending on the particular implementation.

In one example of an intelligent personal assistant system 106, an AIorchestrator 206 orchestrates communication between components insideand outside the artificial intelligence framework 128. Input modalitiesfor the AI orchestrator 206 may be derived from a computer visioncomponent 208, a speech recognition component 210, and a textnormalization component which may form part of the speech recognitioncomponent 210, for example. The computer vision component 208 mayidentify objects and attributes from visual input (e.g., a photo). Thespeech recognition component 210 may convert audio signals (e.g., spokenutterances) into text. A text normalization component may operate tomake input normalization, such as language normalization by renderingemoticons into text, for example. Other normalization is possible suchas orthographic normalization, foreign language normalization,conversational text normalization, and so forth.

The artificial intelligence framework 128 further includes a naturallanguage understanding or NLU component 214 that operates to extractuser intent and various intent parameters. The NLU component 214 isdescribed in further detail beginning with FIG. 8 .

The artificial intelligence framework 128 further includes a dialogmanager 216 that operates to understand a “completeness of specificity”(for example of an input, such as a search query or utterance) anddecide on a next action type and a related parameter (e.g., “search” or“request further information from user”). For convenience, all userinputs in this description may be referred to as “utterances”, whetherin text, voice, or image-related formats.

In one example, the dialog manager 216 operates in association with acontext manager 218 and a Natural Language Generation (NLG) component212. The context manager 218 manages the context and communication of auser with respect to the intelligent online personal assistant (or“bot”) and the assistant's associated artificial intelligence. Thecontext manager 218 retains a short term history of user interactions. Alonger term history of user preferences may be retained in an identityservice 222, described below. Data entries in one or both of thesehistories may include the relevant intent and all parameters and allrelated results of a given input, bot interaction, or turn ofcommunication, for example. The NLG component 212 operates to compose anatural language utterance out of an AI message to present to a userinteracting with the intelligent bot.

A search component 220 is also included within the artificialintelligence framework 128. The search component 220 may have front andback end units. The back end unit may operate to manage item or productinventory and provide functions of searching against the inventory,optimizing towards a specific tuple of user intent and intentparameters. The search component 220 is designed to serve severalbillion queries per day globally against very large high qualityinventories. The search component 220 can accommodate text, orArtificial Intelligence (AI) encoded voice and image inputs, andidentify relevant inventory items to users based on explicit and derivedquery intents.

An identity service 222 component operates to manage user profiles, forexample explicit information in the form of user attributes, e.g.,“name”, “age”, “gender”, “geolocation”, but also implicit information informs such as “information distillates” such as “user interest”, or“similar persona”, and so forth. The artificial intelligence framework128 may comprise part of or operate in association with, the identityservice 222. The identity service 222 includes a set of policies, APIs,and services that elegantly centralizes all user information, helpingthe artificial intelligence framework 128 to have “intelligent” insightsinto user intent. The identity service 222 can protect online retailersand users from fraud or malicious use of private information.

The identity service 222 of the present disclosure provides manyadvantages. The identity service 222 is a single central repositorycontaining user identity and profile data. It may continuously enrichthe user profile with new insights and updates. It uses account linkingand identity federation to map relationships of a user with a company,household, other accounts (e.g., core account), as well as a user'ssocial graph of people and relationships. The identity service 222evolves a rich notification system that communicates all and only theinformation the user wants at the times and media they choose.

In one example, the identity service 222 concentrates on unifying asmuch user information as possible in a central clearinghouse for search,AI, merchandising, and machine learning models to maximize eachcomponent's capability to deliver insights to each user. A singlecentral repository contains user identity and profile data in ameticulously detailed schema. In an onboarding phase, the identityservice 222 primes a user profile and understanding by mandatoryauthentication in a bot application. Any public information availablefrom the source of authentication (e.g., social media) may be loaded. Insideboarding phases, the identity service 222 may augment the profilewith information about the user that is gathered from public sources,user behaviors, interactions, and the explicit set of purposes the usertells the AI (e.g., shopping missions, inspirations, preferences). Asthe user interacts with the artificial intelligence framework 128, theidentity service 222 gathers and infers more about the user and storesthe explicit data, derived information, and updates probabilities andestimations of other statistical inferences. Over time, in profileenrichment phases, the identity service 222 also mines behavioral datasuch as clicks, impressions, and browse activities for derivedinformation such as tastes, preferences, and shopping verticals. Inidentity federation and account linking phases, when communicated orinferred, the identity service 222 updates the user's household,employer, groups, affiliations, social graph, and other accounts,including shared accounts.

The functionalities of the artificial intelligence framework 128 can begrouped into multiple parts, for example decisioning and context parts.In one example, the decisioning part includes operations by the AIorchestrator 206, the NLU component 214, the dialog manager 216, the NLGcomponent 212, the computer vision component 208 and speech recognitioncomponent 210. The context part of the AI functionality relates to theparameters (implicit and explicit) around a user and the communicatedintent (for example, towards a given inventory, or otherwise). In orderto measure and improve AI quality over time, the artificial intelligenceframework 128 may be trained using sample queries (e.g., a dev set) andtested on a different set of queries (e.g., an eval set), where bothsets may be developed by human curation. Also, the artificialintelligence framework 128 may be trained on transaction and interactionflows defined by experienced curation specialists, or human tastemakeroverride rules 224. The flows and the logic encoded within the variouscomponents of the artificial intelligence framework 128 define whatfollow-up utterance or presentation (e.g., question, result set) is madeby the intelligent assistant based on an identified user intent.

Reference is made further above to example input modalities of theintelligent online personal assistant or bot in an intelligent personalassistant system 106. The intelligent personal assistant system 106seeks to understand a user's intent (e.g., targeted search, compare,shop/browse, and so forth) and any mandatory parameters (e.g., product,product category, item, and so forth) and/or optional parameters (e.g.,explicit information such as attributes of item/product, occasion, andso forth) as well as implicit information (e.g., geolocation, personalpreferences, age, and gender, and so forth) and respond to the user witha well thought out or “intelligent” response. Explicit input modalitiesmay include text, speech, and visual input and can be enriched withimplicit knowledge of user (e.g., geolocation, previous browse history,and so forth). Output modalities can include text (such as speech, ornatural language sentences, or product-relevant information, and imageson the screen of a smart device, e.g., client device 108. Inputmodalities thus refer to the different ways users can communicate withthe bot. Input modalities can also include keyboard or mouse navigation,touch-sensitive gestures, and so forth.

In relation to a modality for the computer vision component 208, aphotograph can often represent what a user is looking for better thantext. The user may not know what an item is called, or it may be hard oreven impossible to use text for fine detailed information that only anexpert may know, for example a complicated pattern in apparel or acertain style in furniture. Moreover, it is inconvenient to type complextext queries on mobile phones, and long text queries typically have poorrecall. Thus, key functionalities of the computer vision component 208may include object localization, object recognition, optical characterrecognition (OCR) and matching against inventory based on visual cuesfrom an image or video. A bot enabled with computer vision isadvantageous when running on a mobile device which has a built-incamera. Powerful deep neural networks can be used to enable computervision applications.

In one example, the dialog manager 216 has as sub-components the contextmanager 218 and the NLG component 212. As mentioned above, the dialogmanager 216 operates to understand the “completeness of specificity” anddeciding on a next action type and parameter (e.g., “search” or “requestfurther information from user”). The context manager 218 operates tomanage the context and communication of a given user towards the bot andits AI. The context manager 218 comprises two parts: a long term historyand a short term memory. Each context manager entry may describe therelevant intent and all parameters and all related results. The contextis towards the inventory, as well as towards other, future sources ofknowledge. The NLG component 212 operates to compose a natural languageutterance out of an AI message to present to a user interacting with theintelligent bot.

Fluent, natural, informative, and even entertaining dialog between manand machine is a difficult technical problem that has been studied formuch of the past century, yet is still considered unsolved. However,recent developments in AI have produced useful dialog systems such asSiri™ and Alexa™.

In an ecommerce example of an intelligent bot, an initial very helpfulelement in seeking to solve this problem is to leverage enormous sets ofe-commerce data. Some of this data may be retained in proprietarydatabases or in the cloud e.g., database/cloud 126. Statistics aboutthis data may be communicated to dialog manager 216 from the searchcomponent 220 as context. The artificial intelligence framework 128 mayact directly upon utterances from the user, which may be run throughspeech recognition component 210, then the NLU component 214, and thenpassed to context manager 218 as semi-parsed data. The NLG component 212may thus help the dialog manager 216 generate human-like questions andresponses in text or speech to the user. The context manager 218maintains the coherency of multi-turn and long term discourse betweenthe user and the artificial intelligence framework 128.

Discrimination may be recommended to poll a vast e-commerce dataset foronly relevant, useful information. In one example, the artificialintelligence framework 128 uses results from the search component 220and intelligence within the search component 220 to provide thisinformation. This information may be combined with the history ofinteraction from the context manager 218. The artificial intelligenceframework 128 then may decide on the next turn of dialog, e.g., whetherit should be a question, or a “grounding statement” to validate, forexample, an existing understanding or user intent, or an itemrecommendation (or, for example, any combination of all three). Thesedecisions may be made by a combination of the dataset, the chat historyof the user, and a model of the user's understanding. The NLG component212 may generate language for a textual or spoken reply to the userbased on these decisions.

Technical solutions provided by the present inventive subject matterallow users to communicate with an intelligent online personal assistantin a natural conversation. The assistant is efficient as over time itincreasingly understands specific user preferences and is knowledgeableabout a wide range of products. Though a variety of convenient inputmodalities, a user can share photos, or use voice or text, and theassisted user experience may be akin to talking to a trusted,knowledgeable human shopping assistant in a high-end store, for example.

Conventionally, the approach and data used by online shopping systemsaim at a faceless demographic group of buyers with blunt, simplifiedassumptions to maximize short-term revenue. Conventional sites and appsdo not understand how, why, and when users want to be notified.Notifications may be annoying, inappropriate, and impersonal, obliviousto each user's preferences. One person is not the same as a singleaccount. People share accounts and devices. Passwords make platformsneither safe nor easy to use. Problems of weak online identity and theignoring of environmental signals (such as device, location,notification after anomalous behavior) make it easy to conduct fraud inthe marketplace.

With reference to FIG. 3 , the illustrated components of the speechrecognition component 210 are now described. A feature extractioncomponent operates to convert raw audio waveform to some-dimensionalvector of numbers that represents the sound. This component uses deeplearning to project the raw signal into a high-dimensional semanticspace. An acoustic model component operates to host a statistical modelof speech units, such as phonemes and allophones. These can includeGaussian Mixture Models (GMM) although the use of Deep Neural Networksis possible. A language model component uses statistical models ofgrammar to define how words are put together in a sentence. Such modelscan include n-gram-based models or Deep Neural Networks built on top ofword embeddings. A speech-to-text (STT) decoder component may convert aspeech utterance into a sequence of words typically leveraging featuresderived from a raw signal using the feature extraction component, theacoustic model component, and the language model component in a HiddenMarkov Model (HMM) framework to derive word sequences from featuresequences. In one example, a speech-to-text service in the cloud (e.g.,database/cloud 126) has these components deployed in a cloud frameworkwith an API that allows audio samples to be posted for speech utterancesand to retrieve the corresponding word sequence. Control parameters areavailable to customize or influence the speech-to-text process.

In one example of an artificial intelligence framework 128, twoadditional parts for the speech recognition component 210 are provided,a speaker adaptation component and a Language Model (LM) adaptationcomponent. The speaker adaptation component allows clients of an STTsystem (e.g., speech recognition component 210) to customize the featureextraction component and/or the acoustic model component for eachspeaker/user. This can be important because most speech-to-text systemsare trained on data from a representative set of speakers from a targetregion and typically the accuracy of the system depends heavily on howwell the target speaker matches the speakers in the training pool. Thespeaker adaptation component allows the speech recognition component 210(and consequently the artificial intelligence framework 128) to berobust to speaker variations by continuously learning the idiosyncrasiesof a user's intonation, pronunciation, accent, and other speech factors,and apply these to the speech-dependent components, e.g., the featureextraction component, and the acoustic model component. While thisapproach may require a small voice profile to be created and persistedfor each speaker, the potential benefits of accuracy generally faroutweigh the storage drawbacks.

The LM adaptation component operates to customize the language modelcomponent and the speech-to-text vocabulary with new words andrepresentative sentences from a target domain, for example, inventorycategories or user personas. This capability allows the artificialintelligence framework 128 to be scalable as new categories and personasare supported.

FIG. 3 also shows a flow sequence 302 for text normalization in anartificial intelligence framework 128. A text normalization componentperforming the flow sequence 302 is included in the speech recognitioncomponent 210 in one example. Key functionalities in the flow sequence302 include orthographic normalization (to handle punctuation, numbers,case, and so forth), conversational text normalization (to handleinformal chat-type text with acronyms, abbreviations, incompletefragments, slang, and so forth), and machine translation (to convert anormalized sequence of foreign-language words into a sequence of wordsin an operating language, including but not limited to English forexample).

The artificial intelligence framework 128 facilitates moderncommunications. Millennials for example often want to communicate viaphotos, voice, and text. The technical ability of the artificialintelligence framework 128 to use multiple modalities allows thecommunication of intent instead of just text. The artificialintelligence framework 128 provides technical solutions and isefficient. It is faster to interact with a smart personal assistantusing voice commands or photos than text in many instances.

FIG. 4 is a block diagram illustrating an example software architecture406, which may be used in conjunction with various hardwarearchitectures described herein. FIG. 4 is a non-limiting example of asoftware architecture and it will be appreciated that many otherarchitectures may be implemented to facilitate the functionalitydescribed herein. The software architecture 406 may execute on hardwaresuch as machine 500 of FIG. 5 that includes, among other things,processors 504, memory 514, and input/output (I/O) components 518. Arepresentative hardware layer 452 is illustrated and can represent, forexample, the machine 500 of FIG. 5 . The representative hardware layer452 includes a processing unit 454 having associated executableinstructions 404. Executable instructions 404 represent the executableinstructions of the software architecture 406, including implementationof the methods, components and so forth described herein. The hardwarelayer 452 also includes memory and/or storage modules memory/storage456, which also have executable instructions 404. The hardware layer 452may also comprise other hardware 458.

In the example architecture of FIG. 4 , the software architecture 406may be conceptualized as a stack of layers where each layer providesparticular functionality. For example, the software architecture 406 mayinclude layers such as an operating system 402, libraries 420,applications 416 and a presentation layer 414. Operationally, theapplications 416 and/or other components within the layers may invokeapplication programming interface (API) calls 408 through the softwarestack and receive a response as in response to the API calls 408. Thelayers illustrated are representative in nature and not all softwarearchitectures have all layers. For example, some mobile or specialpurpose operating systems may not provide a frameworks/middleware 418,while others may provide such a layer. Other software architectures mayinclude additional or different layers.

The operating system 402 may manage hardware resources and providecommon services. The operating system 402 may include, for example, akernel 422, services 424 and drivers 426. The kernel 422 may act as anabstraction layer between the hardware and the other software layers.For example, the kernel 422 may be responsible for memory management,processor management (e.g., scheduling), component management,networking, security settings, and so on. The services 424 may provideother common services for the other software layers. The drivers 426 areresponsible for controlling or interfacing with the underlying hardware.For instance, the drivers 426 may include display drivers, cameradrivers, Bluetooth® drivers, flash memory drivers, serial communicationdrivers (e.g., Universal Serial Bus (USB) drivers), Wi-Fi® drivers,audio drivers, power management drivers, and so forth depending on thehardware configuration.

The libraries 420 provide a common infrastructure that is used by theapplications 416 and/or other components and/or layers. The libraries420 may provide functionality that allows other software components toperform tasks in an easier fashion than to interface directly with theunderlying operating system 402 functionality (e.g., kernel 422,services 424, and/or drivers 426). The libraries 420 may include systemlibraries 444 (e.g., C standard library) that may provide functions suchas memory allocation functions, string manipulation functions,mathematical functions, and the like. In addition, the libraries 420 mayinclude API libraries 446 such as media libraries (e.g., libraries tosupport presentation and manipulation of various known media formatssuch as MPREG4, H.264, MP3, AAC, AMR, JPG, and PNG), graphics libraries(e.g., an OpenGL framework that may be used to render 2D and 3D graphiccontent on a display), database libraries (e.g., SQLite that may providevarious relational database functions), web libraries (e.g., WebKit thatmay provide web browsing functionality), and the like. The libraries 420may also include a wide variety of other libraries 448 to provide manyother APIs to the applications 416 and other softwarecomponents/modules.

The frameworks frameworks/middleware 418 (also sometimes referred to asmiddleware) may provide a higher-level common infrastructure that may beused by the applications 416 and/or other software components/modules.For example, the frameworks/middleware 418 may provide various graphicuser interface (GUI) functions, high-level resource management,high-level location services, and so forth. The frameworks/middleware418 may provide a broad spectrum of other APIs that may be utilized bythe applications 416 and/or other software components/modules, some ofwhich may be specific to a particular operating system or platform.

The applications 416 include built-in applications 438 and/orthird-party applications 440. Examples of representative built-inapplications 438 may include, but are not limited to, a contactsapplication, a browser application, a book reader application, alocation application, a media application, a messaging application,and/or a game application. Third-party applications 440 may include anyan application developed using the ANDROID™ or IOS™ software developmentkit (SDK) by an entity other than the vendor of the particular platform,and may be mobile software running on a mobile operating system such asIOS™ ANDROID™, WINDOWS® Phone, or other mobile operating systems. Thethird-party applications 440 may invoke the API calls 408 provided bythe mobile operating system (such as operating system 402) to facilitatefunctionality described herein.

The applications 416 may use built in operating system functions (e.g.,kernel 422, services 424 and/or drivers 426), libraries 420, andframeworks/middleware 418 to create user interfaces to interact withusers of the system. Alternatively, or additionally, in some systemsinteractions with a user may occur through a presentation layer, such aspresentation layer 414. In these systems, the application/component“logic” can be separated from the aspects of the application/componentthat interact with a user.

Some software architectures use virtual machines. In the example of FIG.4 , this is illustrated by a virtual machine 410. The virtual machine410 creates a software environment where applications/components canexecute as if they were executing on a hardware machine (such as themachine 500 of FIG. 5 , for example). The virtual machine 410 is hostedby a host operating system (operating system (OS) 436 in FIG. 4 ) andtypically, although not always, has a virtual machine monitor 460, whichmanages the operation of the virtual machine as well as the interfacewith the host operating system (e.g., operating system 402). A softwarearchitecture executes within the virtual machine 410 such as anoperating system operating system (OS) 436, libraries 434, frameworks432, applications 430 and/or presentation layer 428. These layers ofsoftware architecture executing within the virtual machine 410 can bethe same as corresponding layers previously described or may bedifferent.

FIG. 5 is a block diagram illustrating components of a machine 500,according to some example embodiments, which is able to readinstructions from a machine-readable medium (e.g., a machine-readablestorage medium) and perform any one or more of the methodologiesdiscussed herein. Specifically, FIG. 5 shows a diagrammaticrepresentation of the machine 500 in the example form of a computersystem, within which instructions 510 (e.g., software, a program, anapplication, an applet, an app, or other executable code) for causingthe machine 500 to perform any one or more of the methodologiesdiscussed herein may be executed. As such, the instructions may be usedto implement modules or components described herein. The instructionstransform the general, non-programmed machine into a particular machineprogrammed to carry out the described and illustrated functions in themanner described. In alternative embodiments, the machine 500 operatesas a standalone device or may be coupled (e.g., networked) to othermachines. In a networked deployment, the machine 500 may operate in thecapacity of a server machine or a client machine in a server-clientnetwork environment, or as a peer machine in a peer-to-peer (ordistributed) network environment. The machine 500 may comprise, but isnot limited to, a server computer, a client computer, a personalcomputer (PC), a tablet computer, a laptop computer, a netbook, aset-top box (STB), a personal digital assistant (PDA), an entertainmentmedia system, a cellular telephone, a smart phone, a mobile device, awearable device (e.g., a smart watch), a smart home device (e.g., asmart appliance), other smart devices, a web appliance, a networkrouter, a network switch, a network bridge, or any machine capable ofexecuting the instructions 510, sequentially or otherwise, that specifyactions to be taken by machine 500. Further, while only a single machine500 is illustrated, the term “machine” will also be taken to include acollection of machines that individually or jointly execute theinstructions 510 to perform any one or more of the methodologiesdiscussed herein.

The machine 500 may include processors 504, memory memory/storage 506,and I/O components 518, which may be configured to communicate with eachother such as via a bus 502. The memory/storage 506 may include a memory514, such as a main memory, or other memory storage, and a storage unit516, both accessible to the processors 504 such as via the bus 502. Thestorage unit 516 and memory 514 store the instructions 510 embodying anyone or more of the methodologies or functions described herein. Theinstructions 510 may also reside, completely or partially, within thememory 514, within the storage unit 516, within at least one of theprocessors 504 (e.g., within the processor's cache memory), or anysuitable combination thereof, during execution thereof by the machine500. Accordingly, the memory 514, the storage unit 516, and the memoryof processors 504 are examples of machine-readable media.

The I/O components 518 may include a wide variety of components toreceive input, provide output, produce output, transmit information,exchange information, capture measurements, and so on. The specific I/Ocomponents 518 that are included in a particular machine will depend onthe type of machine. For example, portable machines such as mobilephones will likely include a touch input device or other such inputmechanisms, while a headless server machine will likely not include sucha touch input device. It will be appreciated that the I/O components 518may include many other components that are not shown in FIG. 5 . The I/Ocomponents 518 are grouped according to functionality merely forsimplifying the following discussion and the grouping is in no waylimiting. In various example embodiments, the I/O components 518 mayinclude output components 526 and input components 528. The outputcomponents 526 may include visual components (e.g., a display such as aplasma display panel (PDP), a light emitting diode (LED) display, aliquid crystal display (LCD), a projector, or a cathode ray tube (CRT)),acoustic components (e.g., speakers), haptic components (e.g., avibratory motor, resistance mechanisms), other signal generators, and soforth. The input components 528 may include alphanumeric inputcomponents (e.g., a keyboard, a touch screen configured to receivealphanumeric input, a photo-optical keyboard, or other alphanumericinput components), point based input components (e.g., a mouse, atouchpad, a trackball, a joystick, a motion sensor, or other pointinginstrument), tactile input components (e.g., a physical button, a touchscreen that provides location and/or force of touches or touch gestures,or other tactile input components), audio input components (e.g., amicrophone), and the like.

In further example embodiments, the I/O components 518 may includebiometric components 530, motion components 534, environmentalenvironment components 536, or position components 538 among a widearray of other components. For example, the biometric components 530 mayinclude components to detect expressions (e.g., hand expressions, facialexpressions, vocal expressions, body gestures, or eye tracking), measurebiosignals (e.g., blood pressure, heart rate, body temperature,perspiration, or brain waves), identify a person (e.g., voiceidentification, retinal identification, facial identification,fingerprint identification, or electroencephalogram basedidentification), and the like. The motion components 534 may includeacceleration sensor components (e.g., accelerometer), gravitation sensorcomponents, rotation sensor components (e.g., gyroscope), and so forth.The environment components 536 may include, for example, illuminationsensor components (e.g., photometer), temperature sensor components(e.g., one or more thermometer that detect ambient temperature),humidity sensor components, pressure sensor components (e.g.,barometer), acoustic sensor components (e.g., one or more microphonesthat detect background noise), proximity sensor components (e.g.,infrared sensors that detect nearby objects), gas sensors (e.g., gasdetection sensors to detection concentrations of hazardous gases forsafety or to measure pollutants in the atmosphere), or other componentsthat may provide indications, measurements, or signals corresponding toa surrounding physical environment. The position components 538 mayinclude location sensor components (e.g., a Global Position System (GPS)receiver component), altitude sensor components (e.g., altimeters orbarometers that detect air pressure from which altitude may be derived),orientation sensor components (e.g., magnetometers), and the like.

Communication may be implemented using a wide variety of technologies.The I/O components 518 may include communication components 540 operableto couple the machine 500 to a network 532 or devices 520 via coupling522 and coupling 524 respectively. For example, the communicationcomponents 540 may include a network interface component or othersuitable device to interface with the network 532. In further examples,communication components 540 may include wired communication components,wireless communication components, cellular communication components,Near Field Communication (NFC) components, Bluetooth® components (e.g.,Bluetooth® Low Energy), Wi-Fi® components, and other communicationcomponents to provide communication via other modalities. The devices520 may be another machine or any of a wide variety of peripheraldevices (e.g., a peripheral device coupled via a Universal Serial Bus(USB)).

Moreover, the communication components 540 may detect identifiers orinclude components operable to detect identifiers. For example, thecommunication components processors communication components 540 mayinclude Radio Frequency Identification (RFID) tag reader components, NFCsmart tag detection components, optical reader components (e.g., anoptical sensor to detect one-dimensional bar codes such as UniversalProduct Code (UPC) bar code, multi-dimensional bar codes such as QuickResponse (QR) code, Aztec code, Data Matrix, Dataglyph, MaxiCode,PDF417, Ultra Code, UCC RSS-2D bar code, and other optical codes), oracoustic detection components (e.g., microphones to identify taggedaudio signals). In addition, a variety of information may be derived viathe communication components 540, such as, location via InternetProtocol (IP) geo-location, location via Wi-Fi® signal triangulation,location via detecting a NFC beacon signal that may indicate aparticular location, and so forth.

With reference now to FIG. 6 , an example environment 600 into which anintelligent online personal assistant provided by the intelligentpersonal assistant system 106 can be deployed is shown. At the center ofthe environment 600, the intelligent bot 602 with AI appears. The botleverages the computer vision component 208, the speech recognitioncomponent 210, the NLU component 214, the dialog manager 216, the NLGcomponent 212, the search component 220, and identity service 222 toengage users in efficient, interesting, and effective dialog to decodetheir intent and deliver personalized results.

An associated application 604 can showcase the bot 602's full power andintelligence with compelling mobile design capabilities and elements.The fabric 606 integrates with Facebook Messenger™, Skype™, and Cortana™(for example) to enable users to transact where they are alreadyspending time. A smart notifications 610 platform delivers the rightinformation at the right time via any number of channels (e.g., SMS,push notification, email, messaging) to users to encourage them toengage with the bot 602 and associated marketplaces. Communities 608features enable users to connect, engage, and interact with theirfriends, tastemakers, and brands using the same messaging systems inwhich they already spend most of their time. Other features includegroup buying and gift buying. A rewards 612 platform incentivizes usersto engage more deeply with the bot 602. Rewards can include deepdiscounts on products, access to unique inventory, and recognition inthe app through scores, levels, etc. At marketing 614, a combination oftraditional, social and other marketing is performed to win theattention of some populations (e.g., millennials) in more personal ways.Conventional techniques can include merchandising, email, search engineoptimization (SEO), and search engine marketing (SEM) as well asexperimental techniques such as social ads, viral coupons, and more totarget new and existing users.

FIG. 7 shows an overview of the intelligent personal assistant system106 processing natural language user inputs to generate an itemrecommendation in an electronic marketplace. Although the intelligentpersonal assistant system 106 is not limited to this use scenario, itmay be of particular utility in this situation. As previously described,any combination of text, image, and voice data may be received by theartificial intelligence framework 128. Image data may be processed bythe computer vision component 208 to provide image attribute data. Voicedata may be processed by the speech recognition component 210 into text.

All of these inputs and others may be provided to the NLU component 214for analysis. The NLU component 214 may operate to parse user inputs andhelp determine the user intent and intent-related parameters. Forexample, the NLU component 214 may discern the dominant object of userinterest, and a variety of attributes and attribute values related tothat dominant object. The NLU component 214 may also determine otherparameters such as the user input type (e.g., a question or a statement)and targeted item recipients. The NLU component 214 may provideextracted data to the dialog manager 216, as well as the AI orchestrator206 previously shown.

The NLU component 214 may generally transform formal and informalnatural language user inputs into a more formal, machine-readable,structured representation of a user's query. That formalized query maybe enhanced further by the dialog manager 216. In one scenario, the NLUcomponent 214 processes a sequence of user inputs including an originalquery and further data provided by a user in response tomachine-generated prompts from the dialog manager 216 in a multi-turninteractive dialog. This user-machine interaction may improve theefficiency and accuracy of one or more automated searches for the mostrelevant items available for purchase in an electronic marketplace. Thesearches may be performed by the search component 220.

Extracting user intent is very helpful for the AI bot in determiningwhat further action is needed. In one ecommerce-related example, at thevery highest level, user intent could be shopping, chit-chat, jokes,weather, etc. If the user intent is shopping, it could relate to thepursuit of a specific shopping mission, gifting an item for a targetrecipient other than the user, or just to browse an inventory of itemsavailable for purchase. Once the high level intent is identified, theartificial intelligence framework 128 is tasked with determining whatthe user is looking for; that is, is the need broad (e.g., shoes,dresses) or more specific (e.g., two pairs of new black Nike™ size 10sneakers) or somewhere in between (e.g., black sneakers)?

In a novel and distinct improvement over the prior art in this field,the artificial intelligence framework 128 may map the user request tocertain primary dimensions, such as categories, attributes, andattribute values, that best characterize the available items desired.This gives the bot the ability to engage with the user to further refinethe search constraints if necessary. For example, if a user asks the botfor information relating to dresses, the top attributes that needspecification might be color, material, and style. Further, over time,machine learning may add deeper semantics and wider “world knowledge” tothe system, to better understand the user intent. For example the input“I am looking for a dress for a wedding in June in Italy” means thedress should be appropriate for particular weather conditions at a giventime and place, and should be appropriate for a formal occasion. Anotherexample might include a user asking the bot for “gifts for my nephew”.The artificial intelligence framework 128 when trained will understandthat gifting is a special type of intent, that the target recipient ismale based on the meaning of “nephew”, and that attributes such as age,occasion, and hobbies/likes of the target recipient should be clarified.

FIG. 8 shows a visual search service 800 that interacts with a knowledgegraph service 822, according to some example embodiments. The presentinventors have recognized, among other things, that product searches inan electronic marketplace may be improved by combining a visual searchwith a knowledge graph based search. An electronic marketplace thatprocesses millions of products per week may collect a vast amount ofdata about its inventory, which may be represented by knowledge graphentries.

However, existing text-based search tools may not always capitalize onthat data as effectively as tools that perform visual searches orcombine visual searches and knowledge graph based searches. Thisdescription therefore provides a method for processing natural languageinput so a user can easily take advantage of both knowledge graphinformation (via a multi-turn dialog for example), and visual queryinformation so that a user can easily take advantage of product imageinformation.

The visual search service 800 may, in one embodiment, comprise softwareinstructions called for execution on at least one hardware-basedprocessor by the AI orchestrator 206 which coordinates informationwithin the artificial intelligence framework 128 as previouslydescribed. In one embodiment, the visual search service 800 may be partof the computer vision component 208, but in general the visual searchservice 800 may coordinate information between any number of components.

The visual search service 800 may receive an image query 802 originatingfrom a user. The image query 802 may comprise one or more images theuser believes will be helpful in finding a particular product. Forsimplicity, but not by limitation, this description may refer to imagequery 802 as comprising a single input query image. The input queryimage may comprise a photograph, a video frame, a sketch, or a diagram,for example. The input query image is typically a digital image filesuch as may be produced by a portable camera or smartphone, or such asmay be copied from a web site or an electronic message.

The visual search service 800 may comprise or interact with a number offunctional component blocks, which may each comprise a hardwareprocessor implemented software program for example. A neural network(not shown) in the visual search service 800 for example may process theinput query image. The neural network may comprise a fully convolutionalneural network (FCN) as described in the previously cited article byLong et al. In another embodiment, the neural network may comprise ahybrid neural network (termed a CRF-RNN) including a fully convolutionalneural network and a recurrent network (RNN) that includes conditionalrandom fields (CRF) as described in the previously cited article byZheng et al.

Images processed by the neural network may comprise an input query imagefrom the image query 802 as well as any number of images associated withany number of candidate products in an electronic marketplace, forexample. The neural network in the visual search service 800 may producean image signature that concisely describes image content. In general,an image signature may numerically describe a number of image featuresand their relative dominance of overall image content. Each imagesignature may comprise a vector of binary numbers for example, alsoreferred to as a binary hash. Any form of image signature may beconsidered to be within the scope of this description.

The visual search service 800 may generate metadata and image signaturesfrom input query images. The visual search service 800 may also receivemetadata and image signatures from product images, shown in block 814.Metadata may comprise for example a product identification (ID) numberand a universal resource locator (URL) for a product listing in theelectronic marketplace.

The visual search service 800 may then calculate a visual similaritymeasure between images, such as between a particular candidate productimage and the input query image. The visual similarity measure may beestimated by calculating a distance value between two image signatures.The distance may comprise a Hamming distance, by way of example but notlimitation. A Hamming distance generally describes the number of bitsthat are different in two binary vectors. Similar images being comparedmay therefore have a smaller Hamming distance between them, and thus ahigher visual similarity measure. The visual similarity measure istherefore useful as a search result score, e.g., for the candidateproduct at hand.

In one embodiment, the external data in block 814 is computed offlinefor some or all of the products or items available in an electronicmarketplace, and is stored in a database. For a marketplace with a largenumber of products available, substantially real time computation ofimage signatures may not be computationally feasible. The visual searchservice 800 may not only be used to help a shopper find a relevantproduct in the electronic marketplace, but may be used for otherexternal visual search tasks assigned by external tools.

In one embodiment, each product image previously provided by sellers inan electronic marketplace may be processed to generate an imagesignature that may be stored in the index 814. The processing may beperformed offline to build a catalog of image signatures withoutinterfering with ongoing “live” operations of the electronicmarketplace.

Any approach for calculating the visual similarity measure may providethe search result score described. Visual search result scores for anynumber of candidate products may for example be generated via visualcomparisons with an input query image as described above. The visualsearch result scores may determine the order in which ranked candidateproducts may be presented to a user in response to the image query 802.The end result of the visual search methodology described may comprisean output item list 818 that may correspond to available products in theelectronic marketplace, for example.

The results of a visual search may be factored into an overall compositesearch scheme in any number of different formulations. In one example, aweighting coefficient may weigh the visual search result score by auser-adjustable weighting factor, and the remaining weight may beapplied to scores from a leaf category prediction from the knowledgegraph. FIG. 11 describes this type of composite weighting in moredetail.

The knowledge graph may have categories, aspects, and aspect valuesprovided by sellers to help buyers find the product in the electronicmarketplace inventory. Similarly, the knowledge graph may includepopular categories, aspects, and aspect values that buyers havefrequently used when searching for particular items. Categories maydescribe predetermined product groupings or sub-groupings provided bythe electronic marketplace (e.g., “wine”, “shoes”, “paint”), or may beopen-ended for seller definition. Categories may be branched, so that aparticularly narrow sub-category may be regarded as a leaf category(e.g., “men's athletic shoes”) that may best narrow a given search to asmall set of items best meeting a specified set of search constraints.

Aspects (also called attributes) may comprise descriptive parametersthat may be specified by particular values, to provide further precisesearch keys for finding a particular product. Exemplary aspects orattributes may include but are not limited to “brand”, “color”, “style”,“material”, “size”. Corresponding exemplary values may include “Nike”,“red”, “running”, “canvas”, “ten”, for example. Knowledge graphconstruction and use is described further in the related applicationspreviously incorporated by reference.

Some aspect values may be readily discernible via a visual search, whileothers may not. The combination of a visual search and a knowledge graphsearch may therefore enhance product searching significantly. Broadlyspeaking, each type of search will find some results, but thecombination finds the intersection of the results. The final output isoften astonishingly accurate and unexpectedly fast.

FIG. 9 shows a visual product search that generates product suggestions,according to some example embodiments. The user provides input queryimage 902 of a shoe. The image signature 814 generated from the inputquery image 902 may in itself denote that a particular category(“women's heels”) and aspects and aspect value tuples (“color:red”, and“style:open toe”) are strongly indicated. The visual search maytherefore be narrowed to those candidate products that share thosecategory, aspect, and aspect values, rather than including all availablecandidate products. That is, a visual search alone may resolvecategories, aspects, and aspect values just as a non-visual search may,depending on the inputs provided.

If the item list 818 includes several strong “hits”, based on the visualsimilarity measures obtained, the knowledge graph service 822 may merelyconfirm that the category, aspects, and aspect values are also stronglylinked to the candidate products. The “bot” may thus elect to performlittle or no further filtering of the candidate products in the itemlist. The “bot” may also elect to output several of the most stronglymatching candidate products rather than generating a statement type orquestion type user prompt.

The number of candidate products displayed (three in this case, 908,910, and 912) may thus generally be based on the amount of resultfiltering performed at block 824. That is, if the knowledge graphservice 822 finds a strong indication that little re-ranking orfiltering is required to reconcile the two types of searches, it mayelect to output only a small number of the best matching candidateproducts. If the indication is sufficiently strong, as will be shown inFIG. 11 , only a single matching candidate product may need to bedisplayed.

In general, searching and prompting may iterate until a stoppingcondition occurs. A user may provide input that clearly indicates anintent change for example. A user may also submit an entirely new queryto try a different approach to achieving the same or similar searchintent; this is often the result of user frustration. And, of course, auser may be sufficiently pleased with the search results to make a finalselection of a candidate product (or products) that have beensystematically gathered during the multi-turn dialog.

FIG. 10 shows a visual product search that generates product suggestionsin response to user non-image input, according to some exampleembodiments. The visual search service 800 performs a visual searchbased on the image signature it generates for input query image 1002 ofa dress, in comparison with image signatures of candidate product imagesin database 814. The result is an internal ranked list of potentiallymatching items 818, as previously described.

However, unlike the previous example, the “bot” acts to provideexpertise via a statement prompt, a question prompt, and a candidateproduct image suggestion prompt via a dialog turn first. Why? In thisexample, a user also provides a natural language utterance (e.g., textor voice converted to text) input question.

The parsed input question “How about this for a formal dinner party?”contains the term “formal” that may be recognized as an aspect value 820that also appears in the knowledge graph, perhaps for the aspect“style”. The term “dinner party” may also be recognized, perhaps as anaspect value for the aspect “occasion” in the knowledge graph. The “bot”therefore identifies the user's intent from its processing of thenatural language utterance. It further recognizes that the visual searchresults do not strongly correlate with candidate product images for“formal” and “dinner party” dresses. This mismatch would lead to a greatdeal of filtering, perhaps such that no acceptable candidate productsmay be found.

The knowledge graph service 822 therefore elects to generate both astatement type user prompt and a question type user prompt at 824 tosolicit further user input. The statement type user prompt “That lookscasual.” therefore denotes the conflicting results between the visualsearch and the knowledge graph search. The statement explains therecognition of the contradiction in the results of different searchtypes to the user.

The question type prompt asks the user to consider “more elegant”choices. This behavior is probably because the term “elegant” isassociated in the knowledge graph historical data with prior buyersearches and/or seller listings for “formal” and “dinner party” dresses.The knowledge graph may contain sufficiently strong attribute valuesthat link “formal” and “dinner party” to particular candidate products.That is, because many sellers historically described these particularcandidate products using the terms “formal” or “dinner party”, they maybe recognized as strong candidates. Similarly, if many buyershistorically used “formal” or “dinner party” frequently to find theseparticular candidate products, that also may strengthen the linkagebetween the user utterance terms and the user input query image.

The “bot” therefore elects to provide product suggestion type userprompts to determine the user's true intent, perhaps in spite of someuser misunderstanding. Candidate products 1008, 1010, and 1012 aredisplayed for the user because there is sufficient confidence that theuser will proceed to make a final product selection, rather thanproviding further utterances. This example therefore indicates that theintelligent online personal assistant can provide helpful expertguidance via a variety of output types, when necessary. That is, theuser prompts may comprise selectable candidate product images for thehighest ranked candidate products, or questions requesting furthernatural input data describing unresolved aspect values.

FIG. 11 shows a visual product search that generates product suggestionsin response to user category input, according to some exampleembodiments. In this example, a user provides an input query image 1102of a tent, along with circled suggested categories (“Outdoor Play Tents”and “Camping Tents”). The knowledge graph provides a leaf categoryprediction in block 1104, based on historical data, e.g., denoting thatpast sellers and buyers in aggregate have frequently used thesub-category of “Dome Camping Tent” to find items characterized by thesuggested categories.

The visual search service 800 may perform a visual search based on theimage signature it generates for input query image 1102, in comparisonwith image signatures of candidate product images in database 814. Theresult is an internal ranked list of potentially matching items 818, aspreviously described.

However, the knowledge graph service 822 and the visual search service800 interaction may have led in this case to a suggested product otherthan either service may have selected separately. Thus, the filteredresult in this case is a candidate product that bears both the leafcategory predicted and has a candidate product image that very closelymatches the provided input query image 1102. The “bot” therefore hassuch confidence (e.g., weighted support) that the filtered resultsatisfies the user's search interest that the dialog manager 216 electsto output a single product suggestion rather than engage in furtherdialog turns with the user.

FIG. 12 shows a flowchart of a methodology 1200 for visual searchservice and knowledge graph service interaction, according to someexample embodiments. The methodology may process natural language inputand an image query to generate an item recommendation in an electronicmarketplace. This methodology may be implemented via the structuralelements previously described, as well as via instructions executed by aprocessor in a computing machine.

At 1202, the methodology may generate an image signature for each of aplurality of candidate product images and an input query image, wherethe respective image signatures semantically represent respective visualcontent features. The image signature generation may be performed by aneural network. At 1204, the methodology may calculate a visualsimilarity measure between each candidate product image and the inputquery image based on the corresponding image signatures.

At 1206, the methodology may assemble a ranked list of candidateproducts based on the respective visual similarity measures. That is,the visual similarity measure may serve as a search result score, andcandidate products that are the most highly scored may be presentedfirst. The candidate products may have various resolved categories,aspects, and aspect values associated with them, whether determined froma visual search process or a non-visual search process.

At 1208, for each candidate product in the ranked list, the methodologymay provide its corresponding resolved category, aspects, and aspectvalues to a knowledge graph. The knowledge graph may contain aggregatehistorical electronic marketplace user interaction information, such aswhich categories, aspects, and/or aspect values are associated with agiven item. These categories, aspects, and/or aspect values may helprelate an input query image to a product of interest to the user even ifthey are not discernible via visual search.

At 1210, the methodology may generate and output a user prompt thatrequests further user input in a multi-turn dialog. The user prompt maysolicit information used for selecting the next most useful unresolvedaspect values in the knowledge graph, for the most highly rankedcandidate products. By combining visual search information and aknowledge graph based information, embodiments may iteratively filterthe list of candidate products to refine the user query.

At 1212, the methodology may repeat the previously described visualsearch and user prompt operations to refine the ranked list of candidateproducts to better meet the user's interests. This methodology thuseffectively transforms both the input query image and natural languageuser prompt information, in combination, into a list of the mostrelevant products.

FIG. 13 shows an offline index generation service 1306 that interactswith an on-demand visual search service 800, according to some exampleembodiments. The present inventors have recognized, among other things,that offline processing of an immense product inventory may help producedata needed for a subsequent online search request. New products/items1302 may be provided to an electronic marketplace, as items to be soldfor example. Millions of such items may be provided per week for someelectronic marketplaces.

Therefore, the images provided for the new items 1302 may be assembledwith item metadata at block 1308. Item metadata may include for examplea product identification (ID) number and a universal resource locator(URL) assigned to an item listing page on the electronic marketplace'sweb site. Each product, which may be a candidate product of interest toa user at some point, may be processed to create descriptive imagesignatures for one, several, or all of the images provided for it by aseller, for example. Each image signature may comprise a binary hash ofa number of floating point numbers that describe various aspects ofimage content. Thus each image signature is not merely a compressedversion of a product image, but instead comprise a concise semanticsummarization of its predominant content features.

A bulk image signature generation block is shown as item 1306. Thisblock may comprise neural network software operating on a server farm toprocess a vast number of input images in parallel to rapidly produce animage signature for each. Block 1306 may process new item images on aperiodic or more continuous ongoing basis. For a large electronicmarketplace, a daily batch execution may process all of the imagesprovided for all newly listed products, for example.

The product images, image signatures, and metadata may be combined andformatted by index generator 1308. The descriptive information for thenew “live” products may then be put into an inventory database 814 thatmay for example contain the entire inventory for the electronicmarketplace. The inventory database 814 may be centralized, or it may bereplicated to multiple processing sites in a distributed computernetwork. Local availability of a copy of the inventory database 814 mayhelp a local user view product listings on the electronic marketplacewith less network latency, for example.

An online user who is interested in finding a particular product maythen submit an image query 802, to be processed on demand. Image query802 may comprise a number of input query images. Block 1310 may generateimage signatures on demand for the input query images, for examplefollowing the same image signature format used by the electronicmarketplace for the new product listing images.

The visual search service 800 may then calculate, on demand, a visualsimilarity measure between each candidate product image and the inputquery image based on the respective corresponding image signatures. Thevisual search service 800 may comprise neural network software executingon a number of hardware processors, for example. The visual searchservice 800 does not need to compute image signatures for the productsavailable for sale because those have already been generated and storedin the database 814. The combined offline product image processing andthe online query image processing may therefore perform a visual queryfor a user without significant user-discernible delay. This approach toproduct searching may therefore improve the efficiency of user productsearching, and thus effectively improve the operation of the electronicmarketplace computer system as a whole.

FIG. 14 shows a flowchart of a methodology 1400 for offline indexgeneration for visual search, according to some example embodiments. Themethodology may identify a candidate product in an electronicmarketplace based on an on-demand visual comparison between candidateproduct image visual content and visual content of an image query. Thismethodology may be implemented via the structural elements previouslydescribed, as well as via instructions executed by a processor in acomputing machine.

At 1402, the methodology may generate an image signature for eachcandidate product image (or a selected subset), where the respectiveimage signatures semantically represent the candidate product imagevisual content features. The image signature generation may be performedby a neural network. At 1404, the methodology may store, in a database,the image signature along with a product identification (ID) number anda universal resource locator (URL) associated with each candidateproduct image. The generating and storing may be performed for newproducts currently listed in the electronic market place on a periodicbasis or a more continuous ongoing basis.

Since an electronic marketplace may have millions of products/itemslisted at any given time, the generating and storing may be performed bya plurality of processors (e.g., in a server farm) operating inparallel. Each candidate product may have multiple candidate productimages associated with it, and each such image may be similarlyprocessed. The database may be replicated at multiple processing sitesin a distributed computer network, to for example provide faster accessto users in a given geographic region.

At 1406, the methodology may generate a further image signature for aninput query image. This input query image signature may be generated bya neural network, and may be generated “online” or on demand, withlittle or no user-discernible delay. At 1408, the methodology maycalculate a visual similarity measure between each candidate productimage and the input query image. Although described as processing eachcandidate product image here, the methodology may incorporate otheroperations that very effectively reduce the product search space so thata more relevant subset of the candidate product images may be comparedto the input query image via the respective image signatures.

At 1410, the methodology may output a ranked list of candidate productsbased on the visual similarity measure. That is, the visual similaritymeasure may serve as a search result score, and candidate products thatare the most highly scored may be presented first. At 1412, themethodology may update the database with the image signature for theinput query image, even if the input query image is not of an item beingplaced for sale on the electronic marketplace. Thus, although thedatabase may encompass information regarding some or all of the itemsavailable for sale on the electronic marketplace, this feature isexemplary and not limiting. This methodology thus effectively transformsthe input query image into a list of the most relevant products.

The specific ordered combinations described herein may improve theoverall operation of a computer system used to help a user find productsof interest in an electronic marketplace. The input images (e.g.,candidate product images and query images) are effectively transformedinto a ranked list of products that may most satisfy a user's shoppinginterests. The overall speed of searching is improved to reduce the timethat a user has to spend while still searching a given (typically large)number of candidate products, compared with purely text-based searchapproaches of the past. The number of search iterations may also bereduced via the specialized approaches described, so that overallefficiency of user interaction with an electronic marketplace ismeasurably increased.

Although the subject matter has been described with reference tospecific example embodiments, it will be evident that variousmodifications and changes may be made to these embodiments withoutdeparting from the broader spirit and scope of the disclosed subjectmatter. Accordingly, the specification and drawings are to be regardedin an illustrative rather than a restrictive sense. The accompanyingdrawings that form a part hereof, show by way of illustration, and notof limitation, specific embodiments in which the subject matter may bepracticed. The embodiments illustrated are described in sufficientdetail to enable those skilled in the art to practice the teachingsdisclosed herein. Other embodiments may be utilized and derivedtherefrom, such that structural and logical substitutions and changesmay be made without departing from the scope of this disclosure. ThisDescription, therefore, is not to be taken in a limiting sense, and thescope of various embodiments is defined only by any appended claims,along with the full range of equivalents to which such claims areentitled.

Such embodiments of the inventive subject matter may be referred toherein, individually and/or collectively, by the term “invention” merelyfor convenience and without intending to voluntarily limit the scope ofthis application to any single invention or inventive concept if morethan one is in fact disclosed. Thus, although specific embodiments havebeen illustrated and described herein, it should be appreciated that anyarrangement calculated to achieve the same purpose may be substitutedfor the specific embodiments shown. This disclosure is intended to coverany and all adaptations or variations of various embodiments.Combinations of the above embodiments, and other embodiments notspecifically described herein, will be apparent to those of skill in theart upon reviewing the above description.

What is claimed is:
 1. A method comprising: generating a plurality ofcandidate product image signatures for a plurality of candidate productimages; generating an input query image signature for an input queryimage; calculating a plurality of visual similarity measures between theplurality of candidate product image signatures and the input queryimage signature; and outputting a ranked list of the plurality ofcandidate product images based on the plurality of visual similaritymeasures.
 2. The method of claim 1, further comprising: storing, in adatabase, a product image signature of the plurality of candidateproduct image signatures in association with a product identificationnumber associated with a candidate product image of the plurality ofcandidate product images.
 3. The method of claim 2, wherein the storingis performed for a new product listed in an electronic marketplace. 4.The method of claim 1, further comprising: storing, in a database, aproduct image signature of the plurality of candidate product imagesignatures in association with a uniform resource locator associatedwith a candidate product image of the plurality of candidate productimages.
 5. The method of claim 4, wherein the storing is performed for anew product listed in an electronic marketplace.
 6. The method of claim1, wherein the plurality of candidate product image signaturessemantically represent the plurality of candidate product images.
 7. Themethod of claim 1, wherein the input query image is associated with aproduct listed in an electronic marketplace.
 8. A non-transitorycomputer-readable storage medium comprising instructions that, whenexecuted by a processor, cause the processor to perform operationscomprising: generating a plurality of candidate product image signaturesfor a plurality of candidate product images; generating an input queryimage signature for an input query image; calculating a plurality ofvisual similarity measures between the plurality of candidate productimage signatures and the input query image signature; and outputting aranked list of the plurality of candidate product images based on theplurality of visual similarity measures.
 9. The non-transitorycomputer-readable storage medium of claim 8, wherein the operationsfurther comprise: storing, in a database, a product image signature ofthe plurality of candidate product image signatures in association witha product identification number associated with a candidate productimage of the plurality of candidate product images.
 10. Thenon-transitory computer-readable storage medium of claim 9, wherein thestoring is performed for a new product listed in an electronicmarketplace.
 11. The non-transitory computer-readable storage medium ofclaim 8, wherein the operations further comprise: storing, in adatabase, a product image signature of the plurality of candidateproduct image signatures in association with a uniform resource locatorassociated with a candidate product image of the plurality of candidateproduct images.
 12. The non-transitory computer-readable storage mediumof claim 11, wherein the storing is performed for a new product listedin an electronic marketplace.
 13. The non-transitory computer-readablestorage medium of claim 8, wherein the plurality of candidate productimage signatures semantically represent the plurality of candidateproduct images.
 14. The non-transitory computer-readable storage mediumof claim 8, wherein the input query image is associated with a productlisted in an electronic marketplace.
 15. A system comprising: aprocessor; and a memory comprising instructions that, when executed bythe processor, cause the processor to perform operations comprising:generating a plurality of candidate product image signatures for aplurality of candidate product images; generating an input query imagesignature for an input query image; calculating a plurality of visualsimilarity measures between the plurality of candidate product imagesignatures and the input query image signature; and outputting a rankedlist of the plurality of candidate product images based on the pluralityof visual similarity measures.
 16. The system of claim 15, wherein theoperations further comprise: storing, in a database, a product imagesignature of the plurality of candidate product image signatures inassociation with a product identification number associated with acandidate product image of the plurality of candidate product images.17. The system of claim 16, wherein the storing is performed for a newproduct listed in an electronic marketplace.
 18. The system of claim 15,wherein the operations further comprise: storing, in a database, for anew product listed in an electronic marketplace, a product imagesignature of the plurality of candidate product image signatures inassociation with a uniform resource locator associated with a candidateproduct image of the plurality of candidate product images.
 19. Thesystem of claim 18, wherein the storing is performed for a new productlisted in an electronic marketplace.
 20. The system of claim 15, whereinthe plurality of candidate product image signatures semanticallyrepresent the plurality of candidate product images.